feat: add persistent custom operator registry#968
feat: add persistent custom operator registry#968cmgzn wants to merge 8 commits intodatajuicer:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a persistent custom operator registry, enabling externally developed operators to be registered and automatically loaded across sessions. Key changes include the addition of data_juicer.utils.custom_op for registry management, an enhanced CLI for the operator search tool, and improved parameter handling in MCP tool generation. Review feedback identifies safety concerns regarding the manual cleanup of sys.modules during operator unregistration and highlights inconsistencies between the documentation and implementation regarding registry filenames and environment variables. Additionally, more robust error handling was recommended for resolving operator source paths.
| except Exception as e: | ||
| # Clean up partially-initialized module to avoid stale entries | ||
| sys.modules.pop(module_name, None) | ||
| raise RuntimeError(f"Error loading '{abs_path}' as '{module_name}': {e}") |
There was a problem hiding this comment.
check if we need to rollback OPERATORS here
ShenQianli
left a comment
There was a problem hiding this comment.
the persistence model is at the operator level, but the real loading unit is the module/package path -> unreliable for package-based custom ops, relative imports, and multi-operator modules -> i think persistence should be path-based instead.
2d1c83c to
21f6185
Compare
… instead of operator names
21f6185 to
1e075cd
Compare
Summary
Add a persistent JSON-based registry (
~/.data_juicer/custom_op.json) so that user-defined custom operators survive across processes without requiring re-registration on every run.Motivation
Previously, custom operators had to be re-loaded via config every time a process started. This made it cumbersome to work with reusable custom ops across sessions, scripts, and CLI invocations.
Changes
data_juicer/utils/custom_op.py(new) — Core module for persistent custom op management:~/.data_juicer/custom_op.jsonstoring source paths keyed by op nameload_persistent_custom_ops()replays registrations on startup, auto-cleaning stale entriespython -m data_juicer.utils.custom_op {list,register,unregister,reset}config.pydata_juicer/utils/registry.py— Addunregister_module()toRegistryclassdata_juicer/ops/__init__.py— Callload_persistent_custom_ops()at import time after built-in ops are loadeddata_juicer/config/config.py— Replace inline loading logic with a re-export fromcustom_opfor backward compatibilitydata_juicer/tools/op_search.py— HardenOPRecordto handle custom ops with non-standard module paths, missing source files, and absent test filesdata_juicer/tools/DJ_mcp_granular_ops.py— Adapt MCP tooling for enhancedOPRecordfieldsdocs/DeveloperGuide.md,docs/DeveloperGuide_ZH.md— Document the new persistent registration workflowTesting
tests/utils/test_custom_op.pytests/tools/test_op_search.py